Experiments towards a better LVCSR system for tamil
نویسندگان
چکیده
This paper summarizes our latest efforts in the development of a Large Vocabulary Continuous Speech Recognition (LVCSR) system for Tamil at different levels: pronunciation dictionary, language modeling (LM) and front-end. Usually in Tamil there are not many word-pronunciation pairs to train data-driven grapheme-to-phoneme (G2P) converters. Therefore, we explore the correlation between the amount of training data and the performance of the grapheme-to-phoneme (G2P) conversion. To address the morphological complexity of Tamil, we investigate different levels of morphemes for language modeling including a comparison between our Dictionary Unit Merging Algorithm (DUMA) and Morfessor, followed by various experiments on hybrid systems using word and morpheme LMs. Finally, we integrate our multilingual bottle-neck features framework with Tamil LVCSR. The final best system produced 21.34% Syllable Error Rate (SyllER) on our Tamil test set.
منابع مشابه
Experiments towards a Multi-language LVCSR Interface
This paper describes experiments towards a multilanguage human-computer speech interface. Our interface is designed for large vocabulary continuous speech input. For this purpose a multilingual dictation database has been collected under GlobalPhone, which is a project at the Interactive Systems Labs. This project investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, ...
متن کاملIssues in developing LVCSR System for Dravidian Languages: An exhaustive case study for Tamil
Research in the area of Large Vocabulary Continuous Speech Recognition (LVCSR) for Indian languages has not seen the level of advancement as in English since there is a dearth of large scale speech and language corpora even today. Tamil is one among the four major Dravidian languages spoken in southern India. One of the characteristics of Tamil is that it is morphologically very rich. This qual...
متن کاملMultilingual and Crosslingual Speech Recognition
This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Tu...
متن کاملHandling Non-native Speech in LVCSR: A Preliminary Study
In moving towards full incorporation of CSR in applications whose users include non-native speakers, an understanding of how the system can be modified to increase its tolerance to non-native idiosyncrasies such as accented pronunciation and disfluent form is essential. While experiments geared towards restricteduse systems have suggested that extremely simple techniques are effective, prelimin...
متن کاملTowards better language modeling for Thai LVCSR
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes us...
متن کامل